Approach

Uncontrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Find the minimal combination of algorithm inputs that maximize accuracy. If there are ties, break them by using the point that requires the least data.
  3. Find the costs associated with running the algorithm with those inputs on all different hardware configurations.
  4. Find the combination of hardware that jointly minimizes cost and time.

Data-contrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Subset the accuracy surface produced above to the amount of data available. The maximizing point will fall in the upper right corner of the subsetted space.
  3. Find the costs associated with running the algorithms with the accuracy-maximizing point on all different hardwares.
  4. Find the combination of hardware that jointly minimizes time and cost.

Cost-constrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Subset the space produced above to the amount of time and money able to be spent on modelling.
  3. Working backwards now, find the accuracies that can be produced in the limited time.
  4. Using the subset of accuracy space, find the combination of algorithm inputs that maximizes accuracy.
knitr::opts_chunk$set(cache=TRUE, echo=F, warning=F, error = F, message=F)
knitr::opts_knit$set(root.dir = "/users/scottsfarley/documents")
setwd("/users/scottsfarley/documents")
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
library(akima)
library(ggplot2)
options(java.parameters = "-Xmx1500m")
library(bartMachine)
## Loading required package: rJava
## Loading required package: bartMachineJARs
## Loading required package: car
## Warning: replacing previous import 'lme4::sigma' by 'stats::sigma' when
## loading 'pbkrtest'
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Loading required package: missForest
## Loading required package: itertools
## Welcome to bartMachine v1.2.3! You have 1.4GB memory available.
## 
## If you run out of memory, restart R, and use e.g.
## 'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
## 'library(bartMachine)'.
library(reshape2)
library(ggdendro)

threshold.time <- 22 ##seconds
threshold.cost <- 30 ##cents
threshold.numTex <- 45

First, get the training data and fit the model. Perform some skill checks on it.

## serializing in order to be saved for future R sessions...done
## serializing in order to be saved for future R sessions...done

Choose a finite number of possible solutions to the model. Ideally, we would want every single combination of predictor variables [0, Inf]. This is obviously intractable. Moreover, I only have data for a subset of that space anyways. So randomly sample the subspace in which I have data to make the problem possible to solve.

Using that subset of data and the models we fit previously, predict each candidate configuration of algorithm inputs and hardware variables for execution time and SDM accuracy.

Plot the posterior means of the accuracy models against the algorithm inputs that should control accuracy. In this case, these are number of training examples and number of covariates.

The accuracy clearly varies from low (few training examples and few covariates) to very high (many covariates, many training examples). Perhaps more data would be helpful here, but what are you going to do. Our task is to find the combinations of inputs that results in the highest accuracy model. If there’s a tie, find the combination that needs the least data.

Now, we know the combination of algorithm inputs that result in the highest accuracy. The figure below shows the combination identified on the training examples and covariates axes. This combination of training examples and number of covariates can be run on any combination of hardware. Some might be suboptimal. Thus, at this point, we’ve solved half of our challenge: algorithm inputs have been optimized, now it’s time optimize hardware.

## [1] "Accuracy is maximized at 10000 training examples and 5 predictors."

In theory, the hardware parameters should not affect the SDM accuracy. We can test this assumption here, by plotting the accuracies obtained for this combination of algorithm inputs against modeled accuracy on the number of CPUs and amount of memory. If the assumption is valid, the plot should show no change in either the horizontal or vertical directions. We see that there is, in fact, some change, though. This is likely due to expeirmental design, and lack of a full factorial design setup. The effect is realtively minor, and I choose to comment it and move along.

## [1] "Accuracy Range on Hardware:  0.0224801204365388"
## [1] "Accuracy Range from Expectation:  0"
## [1] "------"
## [1] "Fixing accuracy at:  0.852327149873655"

Now, fix the algorithm inputs at the accuracy-maximizing point– effectively fixing expected model accuracy. An algorithm with these inputs can be run on any combination of hardware. Project how long that specific model would take and how much it would cost on all computing types. Plot those out on time vs. cost axes.

The optimal solution is the one that balances time and cost equally during the minimization. We use euclidean distance here, which normalizes each dimension by its standard deviation, so they are weighted equally. For each candidate combiantion of hardware, we calculate the distance between it and the origin of these two axes. We then find the minimum of that distance matrix and call that point the optimal.

Our job is complete. We’ve now optimized both the harware and software dimensions of the problem.

## [1] "------RANDOM FOREST OPTIMAL--------"
## [1] "Predicted Optimal Accuracy 0.852327149873655 +/- 0"
## [1] "Predicted Optimal Cost (seconds) 21.8684377437557"
## [1] "Predicted Optimal Cost (cents) 8.54574810150484"
## [1] "Cores:  13"
## [1] "Memory: 2"
## [1] "Training Examples: 10000"
## [1] "Covariates: 5"

Everything up to this point was done using the mean of the posterior distribution, a choice which simplifies the process but causes some information loss and may cause over-confidence in the predictions. We can modify our steps to include information from the entire posterior, which may solve this issue.

Instead of projecting just the mean time and mean cost for use the the distance minimization, use the entire set of posterior samples. Calculate the distance metric for each sample in the posterior independently. You’re then left with a density distribution of distances, from which we can infer the minimum value.

The posteriors are in a line, since there’s a fixed linear relationship between time and cost.

Now, find the distance metrics for all of those points.

There’s a lot of overlab in this figure, and many points are far away from the optimal. We don’t care about those. Take the few closest to the minimum and look at their distributions.

Now, the optimal configuration may be one of the following:

config cores GBMemory seconds cost distance.mean distance.sd
145 13 2 21.86844 8.545748 24.84752 8.630945
157 14 2 21.78000 9.165896 25.03071 8.703743
169 15 2 21.71175 9.789829 25.19974 8.699931
133 12 2 22.60012 8.152314 25.38249 8.675544
121 11 2 22.74969 7.522412 25.38255 9.059256
181 16 2 21.73022 10.451368 25.52088 8.824121
109 10 2 23.31605 7.008804 25.83321 9.269416
193 17 2 21.50065 10.987265 25.85966 9.837260
204 17 22 31.88682 124.926831 25.94901 10.024839
216 18 22 31.48957 130.627550 26.40813 10.404463
228 19 22 31.54489 138.126864 26.59251 10.459257
97 9 2 24.22601 6.554105 26.64450 9.645858
240 20 22 31.34982 144.497609 26.94817 10.606178
252 21 22 31.35484 151.746748 27.16810 10.622984
264 22 22 31.15261 157.947470 27.30016 10.789988
276 23 22 30.79635 163.238542 27.92258 11.218416
73 7 2 25.85374 5.440143 28.23476 10.486264
61 6 2 26.18430 4.722601 28.33462 10.266267
49 5 2 27.41620 4.120655 29.55640 10.792338
85 8 2 28.01131 6.736160 31.04108 12.023291
37 4 2 37.14908 4.466805 39.93313 15.138660
146 13 4 36.47393 23.755472 44.76697 10.320824
98 9 4 40.77590 18.385853 46.02494 10.745413
205 18 2 21.25743 11.501970 49.37896 13.920546
25 3 2 48.52713 4.376176 51.96672 19.640752

config cores GBMemory seconds cost distance.mean distance.sd cluster
145 13 2 21.86844 8.545748 24.84752 8.630945 1
157 14 2 21.78000 9.165896 25.03071 8.703743 1
169 15 2 21.71175 9.789829 25.19974 8.699931 1
133 12 2 22.60012 8.152314 25.38249 8.675544 1
121 11 2 22.74969 7.522412 25.38255 9.059256 1
181 16 2 21.73022 10.451368 25.52088 8.824121 1
109 10 2 23.31605 7.008804 25.83321 9.269416 1
193 17 2 21.50065 10.987265 25.85966 9.837260 1
97 9 2 24.22601 6.554105 26.64450 9.645858 1
In the re sults ab ove, you’re accutally seeing the t rade off between time and mone y play out quite nicely. Adding cores costs money, but, in the case of random forests, reduces time. Here, that tradeoff basically exactly evens out.

Cost-Constrained Optimization

There are two main types of constraints on this optimization problem: (1) limited time and/or money and (2) limited data. In the first case, the researcher only has so much time or money (or both) available to be spent on modelling. She must optimize her workflow so that she can get the most out of the limited funds she has available to her. For one experiment, this doesn’t really hold water, since there are only cents and seconds being spent on computing the models. However, when think about global syntheses with many species, these add up to be significant expenditures.

In this simple example, the researcher has a hard maximum of 10 seconds and 25 cents to be spent on the model.

First, calculate a surface of all the possible configurations, whether or not they meet her threshold.

Now, subset that surface, retaining only the configurations that satisfy her threshold in time and in money. Find the maximum amount of data that can be used, and calculate all the possible accuracies that could be achieved using it. This down-weights expensive computing types, and encourages the solution to have a high accuracy.

## [1] "There are  6776 candidates."

Notice the change is scales. We’re not able to get to a point with more than 4000 training examples now. Instead, we’ve limited to low data, and lower accuracy, because of the time/money constraint.

## [1] "Accuracy is maximized at 1951 training examples and 4 predictors."

## [1] "Expected accuracy in this scenario is:  0.813640482335581"
## [1] "Fixing training examples at:  1951"
## [1] "Fixing covariates at:  4"

Finally, come back around, and find the computing hardware that’s best for these inputs.

## [1] "Now there are only:  287 candidates, instead of  287 candidates that can complete this scenario under budget."

## [1] "Recommended # cores:  10"
## [1] "Recommended Memory:  2"
## [1] "Expected Cost:  1.66502461445852"
## [1] "Expected Seconds:  5.53900404011484"

config cores GBMemory seconds cost distance.mean distance.sd
109 10 2 5.539004 1.665025 6.350774 3.279545
133 12 2 5.475323 1.975059 6.367236 3.255424
121 11 2 5.531121 1.828920 6.397116 3.329692
97 9 2 5.632587 1.523840 6.411663 3.271452
145 13 2 5.477217 2.140387 6.436693 3.288498
157 14 2 5.466670 2.300594 6.504559 3.436027
169 15 2 5.684990 2.563362 6.831951 3.411730
61 6 2 6.745027 1.216533 7.259893 2.530495
49 5 2 7.030630 1.056704 7.525741 2.578269
73 7 2 6.923801 1.456906 7.616173 3.257747
193 17 2 7.162181 3.660018 8.652961 3.825849
204 17 22 16.447625 64.438835 8.719005 3.925326
252 21 22 16.106365 77.949329 9.204787 4.317303
264 22 22 16.075799 81.506229 9.373899 4.350329
276 23 22 16.100316 85.341015 9.801753 5.010196

config cores GBMemory seconds cost distance.mean distance.sd cluster
109 10 2 5.539004 1.665025 6.350774 3.279545 1
133 12 2 5.475323 1.975059 6.367236 3.255424 1
121 11 2 5.531121 1.828920 6.397116 3.329692 1
97 9 2 5.632587 1.523840 6.411663 3.271452 1
145 13 2 5.477217 2.140387 6.436693 3.288498 1
157 14 2 5.466670 2.300594 6.504559 3.436027 1
169 15 2 5.684990 2.563362 6.831951 3.411730 1
73 7 2 6.923801 1.456906 7.616173 3.257747 1

Data Constraint

In this case, we’ve got a constraint on the amount of data available to us.

## [1] "Current data threshold is  45"
## [1] "Accuracy is maximized at 1 training examples and 5 predictors."
## [1] "Expected Max Accuracy is  0.78638064693302"
## [1] "Now there are only:  287 candidates, instead of  287 candidates that can complete this scenario under budget."

## [1] "Recommended # cores:  10"
## [1] "Recommended Memory:  2"
## [1] "Expected Cost:  1.42441508036156"
## [1] "Expected Seconds:  4.7385731216286"

config cores GBMemory seconds cost distance.mean distance.sd
109 10 2 4.738573 1.4244151 5.274748 2.201935
97 9 2 4.807257 1.3005552 5.298139 2.156391
133 12 2 4.677079 1.6871158 5.298231 2.211611
121 11 2 4.732381 1.5648092 5.315057 2.235309
145 13 2 4.683376 1.8301697 5.353266 2.217866
157 14 2 4.679844 1.9694656 5.411921 2.308578
169 15 2 4.885003 2.2026478 5.706941 2.324732
61 6 2 5.679106 1.0242836 5.988099 1.690359
49 5 2 5.949659 0.8942338 6.257268 1.810588
193 17 2 6.802014 3.4759651 8.178757 3.307086
228 19 22 14.150467 61.9612142 8.357764 3.573538
181 16 2 7.223502 3.4742153 8.569679 3.562575
252 21 22 14.021231 67.8579910 8.579634 3.503562
240 20 22 14.040894 64.7172885 8.621569 3.573795
264 22 22 14.003878 71.0013403 8.730317 3.526462

config cores GBMemory seconds cost distance.mean distance.sd cluster
109 10 2 4.738573 1.4244151 5.274748 2.201935 1
97 9 2 4.807257 1.3005552 5.298139 2.156391 1
133 12 2 4.677079 1.6871158 5.298231 2.211611 1
121 11 2 4.732381 1.5648092 5.315057 2.235309 1
145 13 2 4.683376 1.8301697 5.353266 2.217866 1
157 14 2 4.679844 1.9694656 5.411921 2.308578 1
169 15 2 4.885003 2.2026478 5.706941 2.324732 1
61 6 2 5.679106 1.0242836 5.988099 1.690359 1
49 5 2 5.949659 0.8942338 6.257268 1.810588 1